{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download and store STOOQ data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook contains information on downloading the STOOQ stock andETF price data that we use in [Chapter 09](../09_time_series_models) for a pairs trading strategy based on cointegration and [Chapter 11](../11_decision_trees_random_forests) for a long-short strategy using Random Forest return predictions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T14:32:07.092623Z",
     "start_time": "2020-06-18T14:32:07.090885Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-18T14:32:07.263130Z",
     "start_time": "2020-06-18T14:32:07.259861Z"
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import requests\n",
    "from io import BytesIO\n",
    "from zipfile import ZipFile, BadZipFile\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pandas_datareader.data as web\n",
    "from sklearn.datasets import fetch_openml\n",
    "\n",
    "pd.set_option('display.expand_frame_repr', False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set Data Store path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modify path if you would like to store the data elsewhere and change the notebooks accordingly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:27:54.832609Z",
     "start_time": "2020-06-19T02:27:54.824778Z"
    }
   },
   "outputs": [],
   "source": [
    "DATA_STORE = Path('assets.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stooq Historical Market Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Note that the below downloading details may change at any time as Stooq updates their website; if you encounter errors, please inspect their website and raise a GitHub issue to let us know so we can update the information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download price data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Download **price data** for the selected combination of asset class, market and frequency from [the Stooq website](https://stooq.com/db/h/)\n",
    "2. Store the result under `stooq` using the preferred folder structure outlined on the website. It has the structure: `/data/freq/market/asset_class`, such as `/data/daily/us/nasdaq etfs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:17:23.456087Z",
     "start_time": "2020-06-19T02:17:23.449798Z"
    }
   },
   "outputs": [],
   "source": [
    "stooq_path = Path('stooq') \n",
    "if not stooq_path.exists():\n",
    "    stooq_path.mkdir()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:17:23.882490Z",
     "start_time": "2020-06-19T02:17:23.879665Z"
    }
   },
   "outputs": [],
   "source": [
    "market = 'us'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:17:26.841192Z",
     "start_time": "2020-06-19T02:17:26.835288Z"
    }
   },
   "outputs": [],
   "source": [
    "STOOQ_URL = 'https://static.stooq.com/db/h/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:22:53.981479Z",
     "start_time": "2020-06-19T02:22:53.975206Z"
    }
   },
   "outputs": [],
   "source": [
    "data_url = 'd_jp_txt.zip'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:23:01.882835Z",
     "start_time": "2020-06-19T02:22:54.301837Z"
    }
   },
   "outputs": [],
   "source": [
    "response = requests.get(STOOQ_URL + data_url).content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:23:15.379159Z",
     "start_time": "2020-06-19T02:23:01.883773Z"
    }
   },
   "outputs": [],
   "source": [
    "with ZipFile(BytesIO(response)) as zip_file:\n",
    "    for i, file in enumerate(zip_file.namelist()):\n",
    "        if not file.endswith('.txt'):\n",
    "            continue\n",
    "        local_file = stooq_path / file\n",
    "        local_file.parent.mkdir(parents=True, exist_ok=True)\n",
    "        with local_file.open('wb') as output:\n",
    "            for line in zip_file.open(file).readlines():\n",
    "                output.write(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add symbols"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-18T20:43:47.976210Z",
     "start_time": "2020-04-18T20:43:47.968770Z"
    }
   },
   "source": [
    "Add the corresponding **symbols**, i.e., tickers and names by following the directory tree on the same site. You can also adapt the following code snippet using the appropriate asset code that you find by inspecting the url; this example works for NASDAQ ETFs that have code `g=69`:\n",
    "```python\n",
    "df = pd.read_csv('https://stooq.com/db/l/?g=69', sep='        ').apply(lambda x: x.str.strip())\n",
    "df.columns = ['ticker', 'name']\n",
    "df.drop_duplicates('ticker').to_csv('stooq/data/tickers/us/nasdaq etfs.csv', index=False)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T01:30:28.894603Z",
     "start_time": "2020-06-19T01:30:28.885132Z"
    }
   },
   "outputs": [],
   "source": [
    "metadata_dict = {\n",
    "    ('jp', 'tse etfs'): 34,\n",
    "    ('jp', 'tse stocks'): 32,\n",
    "    ('us', 'nasdaq etfs'): 69,\n",
    "    ('us', 'nasdaq stocks'): 27,\n",
    "    ('us', 'nyse etfs'): 70,\n",
    "    ('us', 'nyse stocks'): 28,\n",
    "    ('us', 'nysemkt stocks'): 26\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T01:36:26.300430Z",
     "start_time": "2020-06-19T01:36:22.212007Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "jp tse etfs # tickers: 317\n",
      "jp tse stocks # tickers: 3,715\n",
      "us nasdaq etfs # tickers: 184\n",
      "us nasdaq stocks # tickers: 3,449\n",
      "us nyse etfs # tickers: 1,022\n",
      "us nyse stocks # tickers: 3,918\n",
      "us nysemkt stocks # tickers: 301\n"
     ]
    }
   ],
   "source": [
    "for (market, asset_class), code in metadata_dict.items():\n",
    "    df = pd.read_csv(f'https://stooq.com/db/l/?g={code}', sep='        ').apply(lambda x: x.str.strip())\n",
    "    df.columns = ['ticker', 'name']\n",
    "    df = df.drop_duplicates('ticker').dropna()\n",
    "    print(market, asset_class, f'# tickers: {df.shape[0]:,.0f}')\n",
    "    path = stooq_path / 'tickers' / market\n",
    "    if not path.exists():\n",
    "        path.mkdir(parents=True)\n",
    "    df.to_csv(path / f'{asset_class}.csv', index=False)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Store price data in HDF5 format"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To speed up loading, we store the price data in HDF format. The function `get_stooq_prices_and_symbols` loads data assuming the directory structure described above and takes the following arguments:\n",
    "- frequency (see Stooq website for options as these may change; default is `daily`\n",
    "- market (default: `us`), and \n",
    "- asset class (default: `nasdaq etfs`.\n",
    "\n",
    "It removes files that do not have data or do not appear in the corresponding list of symbols."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:28:05.958722Z",
     "start_time": "2020-06-19T02:28:05.940267Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_stooq_prices_and_tickers(frequency='daily',\n",
    "                                 market='us',\n",
    "                                 asset_class='nasdaq etfs'):\n",
    "    prices = []\n",
    "    \n",
    "    tickers = (pd.read_csv(stooq_path / 'tickers' / market / f'{asset_class}.csv'))\n",
    "\n",
    "    if frequency in ['5 min', 'hourly']:\n",
    "        parse_dates = [['date', 'time']]\n",
    "        date_label = 'date_time'\n",
    "    else:\n",
    "        parse_dates = ['date']\n",
    "        date_label = 'date'\n",
    "    names = ['ticker', 'freq', 'date', 'time', \n",
    "             'open', 'high', 'low', 'close','volume', 'openint']\n",
    "    \n",
    "    usecols = ['ticker', 'open', 'high', 'low', 'close', 'volume'] + parse_dates\n",
    "    path = stooq_path / 'data' / frequency / market / asset_class\n",
    "    print(path.as_posix())\n",
    "    files = path.glob('**/*.txt')\n",
    "    for i, file in enumerate(files, 1):\n",
    "        if i % 500 == 0:\n",
    "            print(i)\n",
    "        if file.stem not in set(tickers.ticker.str.lower()):\n",
    "            print(file.stem, 'not available')\n",
    "            file.unlink()\n",
    "        else:\n",
    "            try:\n",
    "                df = (pd.read_csv(\n",
    "                    file,\n",
    "                    names=names,\n",
    "                    usecols=usecols,\n",
    "                    header=0,\n",
    "                    parse_dates=parse_dates))\n",
    "                prices.append(df)\n",
    "            except pd.errors.EmptyDataError:\n",
    "                print('\\tdata missing', file.stem)\n",
    "                file.unlink()\n",
    "\n",
    "    prices = (pd.concat(prices, ignore_index=True)\n",
    "              .rename(columns=str.lower)\n",
    "              .set_index(['ticker', date_label])\n",
    "              .apply(lambda x: pd.to_numeric(x, errors='coerce')))\n",
    "    return prices, tickers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll be using US equities and ETFs in [Chapter 9](../09_time_series_models) and and Japanese equities in [Chapter 11](../11_decision_trees_random_forests). The following code collects the price data for the period 2000-2019 and stores it with the corresponding symbols in the global `assets.h5` store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-19T02:30:25.079270Z",
     "start_time": "2020-06-19T02:28:06.594886Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "tse stocks\n",
      "stooq/data/daily/jp/tse stocks\n",
      "500\n",
      "1000\n",
      "1500\n",
      "2000\n",
      "2500\n",
      "3000\n",
      "3500\n",
      "\n",
      "No. of observations per asset\n",
      "count    3680.000000\n",
      "mean     2805.153533\n",
      "std      1175.687342\n",
      "min         1.000000\n",
      "25%      2149.500000\n",
      "50%      3041.000000\n",
      "75%      3621.000000\n",
      "max      4905.000000\n",
      "dtype: float64\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "MultiIndex: 10322965 entries, ('1301.JP', Timestamp('2005-03-22 00:00:00')) to ('9997.JP', Timestamp('2019-12-30 00:00:00'))\n",
      "Data columns (total 5 columns):\n",
      " #   Column  Non-Null Count     Dtype  \n",
      "---  ------  --------------     -----  \n",
      " 0   open    10322965 non-null  float64\n",
      " 1   high    10322965 non-null  float64\n",
      " 2   low     10322965 non-null  float64\n",
      " 3   close   10322965 non-null  float64\n",
      " 4   volume  10322965 non-null  int64  \n",
      "dtypes: float64(4), int64(1)\n",
      "memory usage: 433.2+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3715 entries, 0 to 3714\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   ticker  3715 non-null   object\n",
      " 1   name    3715 non-null   object\n",
      "dtypes: object(2)\n",
      "memory usage: 58.2+ KB\n",
      "None\n",
      "\n",
      "nasdaq etfs\n",
      "stooq/data/daily/us/nasdaq etfs\n",
      "\n",
      "No. of observations per asset\n",
      "count     184.000000\n",
      "mean     2091.184783\n",
      "std       871.680544\n",
      "min        74.000000\n",
      "25%      1462.250000\n",
      "50%      2062.000000\n",
      "75%      2542.250000\n",
      "max      5031.000000\n",
      "dtype: float64\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "MultiIndex: 384778 entries, ('AAXJ.US', Timestamp('2008-08-15 00:00:00')) to ('ZIV.US', Timestamp('2019-12-31 00:00:00'))\n",
      "Data columns (total 5 columns):\n",
      " #   Column  Non-Null Count   Dtype  \n",
      "---  ------  --------------   -----  \n",
      " 0   open    384778 non-null  float64\n",
      " 1   high    384778 non-null  float64\n",
      " 2   low     384778 non-null  float64\n",
      " 3   close   384778 non-null  float64\n",
      " 4   volume  384778 non-null  int64  \n",
      "dtypes: float64(4), int64(1)\n",
      "memory usage: 16.2+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 184 entries, 0 to 183\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   ticker  184 non-null    object\n",
      " 1   name    184 non-null    object\n",
      "dtypes: object(2)\n",
      "memory usage: 3.0+ KB\n",
      "None\n",
      "\n",
      "nasdaq stocks\n",
      "stooq/data/daily/us/nasdaq stocks\n",
      "500\n",
      "1000\n",
      "1500\n",
      "2000\n",
      "2500\n",
      "3000\n",
      "\n",
      "No. of observations per asset\n",
      "count    3236.000000\n",
      "mean     1996.798208\n",
      "std      1509.793004\n",
      "min         1.000000\n",
      "25%       513.750000\n",
      "50%      1622.500000\n",
      "75%      3716.250000\n",
      "max      5031.000000\n",
      "dtype: float64\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "MultiIndex: 6461639 entries, ('AACG.US', Timestamp('2008-01-28 00:00:00')) to ('ZYXI.US', Timestamp('2019-12-31 00:00:00'))\n",
      "Data columns (total 5 columns):\n",
      " #   Column  Non-Null Count    Dtype  \n",
      "---  ------  --------------    -----  \n",
      " 0   open    6461639 non-null  float64\n",
      " 1   high    6461639 non-null  float64\n",
      " 2   low     6461639 non-null  float64\n",
      " 3   close   6461639 non-null  float64\n",
      " 4   volume  6461639 non-null  int64  \n",
      "dtypes: float64(4), int64(1)\n",
      "memory usage: 271.2+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3449 entries, 0 to 3448\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   ticker  3449 non-null   object\n",
      " 1   name    3449 non-null   object\n",
      "dtypes: object(2)\n",
      "memory usage: 54.0+ KB\n",
      "None\n",
      "\n",
      "nyse etfs\n",
      "stooq/data/daily/us/nyse etfs\n",
      "fmil.us not available\n",
      "dmay.us not available\n",
      "fbcv.us not available\n",
      "500\n",
      "fmay.us not available\n",
      "fbcg.us not available\n",
      "1000\n",
      "\n",
      "No. of observations per asset\n",
      "count    1000.000000\n",
      "mean     2470.623000\n",
      "std      1000.082459\n",
      "min         2.000000\n",
      "25%      1692.500000\n",
      "50%      2580.000000\n",
      "75%      3325.000000\n",
      "max      3738.000000\n",
      "dtype: float64\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "MultiIndex: 2470623 entries, ('AADR.US', Timestamp('2010-07-21 00:00:00')) to ('ZSL.US', Timestamp('2019-12-31 00:00:00'))\n",
      "Data columns (total 5 columns):\n",
      " #   Column  Non-Null Count    Dtype  \n",
      "---  ------  --------------    -----  \n",
      " 0   open    2470623 non-null  float64\n",
      " 1   high    2470623 non-null  float64\n",
      " 2   low     2470623 non-null  float64\n",
      " 3   close   2470623 non-null  float64\n",
      " 4   volume  2470623 non-null  int64  \n",
      "dtypes: float64(4), int64(1)\n",
      "memory usage: 103.7+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1022 entries, 0 to 1021\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   ticker  1022 non-null   object\n",
      " 1   name    1022 non-null   object\n",
      "dtypes: object(2)\n",
      "memory usage: 16.1+ KB\n",
      "None\n",
      "\n",
      "nyse stocks\n",
      "stooq/data/daily/us/nyse stocks\n",
      "ftai_a.us not available\n",
      "glog_a.us not available\n",
      "ccac-ws.us not available\n",
      "acel-ws.us not available\n",
      "jcap_b.us not available\n",
      "chmi_a.us not available\n",
      "cfcv.us not available\n",
      "clny_h.us not available\n",
      "gmre_a.us not available\n",
      "clny_i.us not available\n",
      "dlng_b.us not available\n",
      "cubi_d.us not available\n",
      "gleo-ws.us not available\n",
      "cubi_f.us not available\n",
      "fvac-u.us not available\n",
      "iret_c.us not available\n",
      "alin_e.us not available\n",
      "500\n",
      "celg-r.us not available\n",
      "atco_i.us not available\n",
      "ccxx-ws.us not available\n",
      "aptv_a.us not available\n",
      "grx_a-cl.us not available\n",
      "ambc-ws.us not available\n",
      "hsbc_a.us not available\n",
      "alin_b.us not available\n",
      "cwen-a.us not available\n",
      "codi_b.us not available\n",
      "banc_e.us not available\n",
      "ipoc-u.us not available\n",
      "feac-u.us not available\n",
      "altg-ws.us not available\n",
      "hfro_a.us not available\n",
      "cubi_e.us not available\n",
      "cmre_e.us not available\n",
      "ipoc-ws.us not available\n",
      "banc_d.us not available\n",
      "ipv-ws.us not available\n",
      "alus-u.us not available\n",
      "1000\n",
      "chmi_b.us not available\n",
      "codi_c.us not available\n",
      "alus-ws.us not available\n",
      "dmyt-ws.us not available\n",
      "fpac-u.us not available\n",
      "cmre_d.us not available\n",
      "glop_b.us not available\n",
      "carr-w.us not available\n",
      "atco_h.us not available\n",
      "cch-ws.us not available\n",
      "feac-ws.us not available\n",
      "iipr_a.us not available\n",
      "hmlp_a.us not available\n",
      "insw_a.us not available\n",
      "aig-ws.us not available\n",
      "glop_a.us not available\n",
      "codi_a.us not available\n",
      "cmre_c.us not available\n",
      "ally_a.us not available\n",
      "bmrg-u.us not available\n",
      "graf-ws.us not available\n",
      "gix-ws.us not available\n",
      "gleo-u.us not available\n",
      "1500\n",
      "clny_g.us not available\n",
      "dxb-cl.us not available\n",
      "atco_e.us not available\n",
      "clny_j.us not available\n",
      "fpac-ws.us not available\n",
      "atco_g.us not available\n",
      "avtr_a.us not available\n",
      "ccx-ws.us not available\n",
      "bmrg-ws.us not available\n",
      "dfns-ws.us not available\n",
      "cubi_c.us not available\n",
      "cmre_b.us not available\n",
      "ccxx-u.us not available\n",
      "dmyt-u.us not available\n",
      "bcsf-r.us not available\n",
      "ares_a.us not available\n",
      "glop_c.us not available\n",
      "atco_d.us not available\n",
      "isg-cl.us not available\n",
      "dfns-u.us not available\n",
      "bcsfr-w.us not available\n",
      "graf-u.us not available\n",
      "ftai_b.us not available\n",
      "ccac-u.us not available\n",
      "2000\n",
      "ipob-u.us not available\n",
      "alin_a.us not available\n",
      "corr_a.us not available\n",
      "dlng_a.us not available\n",
      "mfac-u.us not available\n",
      "tdw-ws-b.us not available\n",
      "pack-ws.us not available\n",
      "schw_c.us not available\n",
      "nycb_u.us not available\n",
      "otis-w.us not available\n",
      "splp_a.us not available\n",
      "prif_d.us not available\n",
      "mitt_c.us not available\n",
      "pcpl-u.us not available\n",
      "rmg-ws.us not available\n",
      "pcpl-ws.us not available\n",
      "trtn_d.us not available\n",
      "rexr_b.us not available\n",
      "rpla-u.us not available\n",
      "sbe-ws.us not available\n",
      "2500\n",
      "pic-ws.us not available\n",
      "star_i.us not available\n",
      "star_g.us not available\n",
      "oibr-c.us not available\n",
      "msge-w.us not available\n",
      "jih-ws.us not available\n",
      "mck-wd.us not available\n",
      "sftw-ws.us not available\n",
      "ngls_a.us not available\n",
      "soac-u.us not available\n",
      "star_d.us not available\n",
      "stag_c.us not available\n",
      "sitc_a.us not available\n",
      "voya_b.us not available\n",
      "psa_v-cl.us not available\n",
      "3000\n",
      "scpe-u.us not available\n",
      "prif_c.us not available\n",
      "vert-u.us not available\n",
      "vst-ws-a.us not available\n",
      "lgc-ws.us not available\n",
      "trne-ws.us not available\n",
      "nycb_a.us not available\n",
      "prif_f.us not available\n",
      "msgs-w.us not available\n",
      "trtn_a.us not available\n",
      "loak-u.us not available\n",
      "rpla-ws.us not available\n",
      "tdw-ws-a.us not available\n",
      "trne-u.us not available\n",
      "lgvw-u.us not available\n",
      "vvnt-ws.us not available\n",
      "scpe-ws.us not available\n",
      "3500\n",
      "shll-u.us not available\n",
      "oac-ws.us not available\n",
      "scvx-ws.us not available\n",
      "spaq-ws.us not available\n",
      "vrt-ws.us not available\n",
      "loak-ws.us not available\n",
      "prif_e.us not available\n",
      "rnr_c-cl.us not available\n",
      "rexr_a.us not available\n",
      "nfh-ws.us not available\n",
      "mfac-ws.us not available\n",
      "nsco-ws.us not available\n",
      "scvx-u.us not available\n",
      "trtn_c.us not available\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "trtn_b.us not available\n",
      "shll-ws.us not available\n",
      "spaq-u.us not available\n",
      "prif_b.us not available\n",
      "rexr_c.us not available\n",
      "mitt_b.us not available\n",
      "prif_a.us not available\n",
      "sitc_k.us not available\n",
      "4000\n",
      "sftw-u.us not available\n",
      "mitt_a.us not available\n",
      "schw_d.us not available\n",
      "lhc-ws.us not available\n",
      "\n",
      "No. of observations per asset\n",
      "count    3775.000000\n",
      "mean     2100.693775\n",
      "std      1578.140734\n",
      "min         1.000000\n",
      "25%       579.500000\n",
      "50%      1742.000000\n",
      "75%      3737.000000\n",
      "max      5031.000000\n",
      "dtype: float64\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "MultiIndex: 7930119 entries, ('A.US', Timestamp('2000-01-03 00:00:00')) to ('ZYME.US', Timestamp('2019-12-31 00:00:00'))\n",
      "Data columns (total 5 columns):\n",
      " #   Column  Non-Null Count    Dtype  \n",
      "---  ------  --------------    -----  \n",
      " 0   open    7930119 non-null  float64\n",
      " 1   high    7930119 non-null  float64\n",
      " 2   low     7930119 non-null  float64\n",
      " 3   close   7930119 non-null  float64\n",
      " 4   volume  7930119 non-null  int64  \n",
      "dtypes: float64(4), int64(1)\n",
      "memory usage: 332.8+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 3918 entries, 0 to 3917\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   ticker  3918 non-null   object\n",
      " 1   name    3918 non-null   object\n",
      "dtypes: object(2)\n",
      "memory usage: 61.3+ KB\n",
      "None\n",
      "\n",
      "nysemkt stocks\n",
      "stooq/data/daily/us/nysemkt stocks\n",
      "uuuu-ws.us not available\n",
      "ycbd_a.us not available\n",
      "phge-ws.us not available\n",
      "brmk-ws.us not available\n",
      "biox-ws.us not available\n",
      "plym_a.us not available\n",
      "talo-ws.us not available\n",
      "phge-u.us not available\n",
      "klr-ws.us not available\n",
      "nspr-ws.us not available\n",
      "nspr-ws-b.us not available\n",
      "tdw-ws.us not available\n",
      "chaq-u.us not available\n",
      "\n",
      "No. of observations per asset\n",
      "count     296.000000\n",
      "mean     2600.081081\n",
      "std      1272.414385\n",
      "min         4.000000\n",
      "25%      1535.000000\n",
      "50%      3293.500000\n",
      "75%      3717.250000\n",
      "max      5029.000000\n",
      "dtype: float64\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "MultiIndex: 769624 entries, ('AAMC.US', Timestamp('2012-12-13 00:00:00')) to ('ZOM.US', Timestamp('2019-12-31 00:00:00'))\n",
      "Data columns (total 5 columns):\n",
      " #   Column  Non-Null Count   Dtype  \n",
      "---  ------  --------------   -----  \n",
      " 0   open    769624 non-null  float64\n",
      " 1   high    769624 non-null  float64\n",
      " 2   low     769624 non-null  float64\n",
      " 3   close   769624 non-null  float64\n",
      " 4   volume  769624 non-null  int64  \n",
      "dtypes: float64(4), int64(1)\n",
      "memory usage: 32.3+ MB\n",
      "None\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 301 entries, 0 to 300\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count  Dtype \n",
      "---  ------  --------------  ----- \n",
      " 0   ticker  301 non-null    object\n",
      " 1   name    301 non-null    object\n",
      "dtypes: object(2)\n",
      "memory usage: 4.8+ KB\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# load some Japanese and all US assets for 2000-2019\n",
    "markets = {'jp': ['tse stocks'],\n",
    "           'us': ['nasdaq etfs', 'nasdaq stocks', 'nyse etfs', 'nyse stocks', 'nysemkt stocks']\n",
    "          }\n",
    "frequency = 'daily'\n",
    "\n",
    "idx = pd.IndexSlice\n",
    "for market, asset_classes in markets.items():\n",
    "    for asset_class in asset_classes:\n",
    "        print(f'\\n{asset_class}')\n",
    "        prices, tickers = get_stooq_prices_and_tickers(frequency=frequency, \n",
    "                                                       market=market, \n",
    "                                                       asset_class=asset_class)\n",
    "        \n",
    "        prices = prices.sort_index().loc[idx[:, '2000': '2019'], :]\n",
    "        names = prices.index.names\n",
    "        prices = (prices\n",
    "                  .reset_index()\n",
    "                  .drop_duplicates()\n",
    "                  .set_index(names)\n",
    "                  .sort_index())\n",
    "        \n",
    "        print('\\nNo. of observations per asset')\n",
    "        print(prices.groupby('ticker').size().describe())\n",
    "        key = f'stooq/{market}/{asset_class.replace(\" \", \"/\")}/'\n",
    "        \n",
    "        print(prices.info(null_counts=True))\n",
    "        \n",
    "        prices.to_hdf(DATA_STORE, key + 'prices', format='t')\n",
    "        \n",
    "        print(tickers.info())\n",
    "        tickers.to_hdf(DATA_STORE, key + 'tickers', format='t')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
